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In this paper, we propose a novel method to select significant variables and estimate the corre¬ 
sponding coefficients in multiple-index models with a group structure. All existing approaches 
for single-index models cannot be extended directly to handle this issue with several indices. 
This method integrates a popularly used shrinkage penalty such as LASSO with the group- 
wise minimum average variance estimation. It is capable of simultaneous dimension reduction 
and variable selection, while incorporating the group structure in predictors. Interestingly, the 
proposed estimator with the LASSO penalty then behaves like an estimator with an adaptive 
LASSO penalty. The estimator achieves consistency of variable selection without sacrificing the 
root-n consistency of basis estimation. Simulation studies and a real-data example illustrate the 
effectiveness and efficiency of the new method. 
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1. Introduction 

Suppose that E G R is a univariate response and X = (Ai,.. .,Xp)^ G is a vector 
of predictors. A general goal of regression analysis is to characterize the conditional 
distribution of Y given X, or the conditional mean E(E|X). The theory of sufficient 
dimension reduction (Li [17] and Cook and Weisberg [10]) provides a framework for 
reducing the dimension of X while preserving information on regression. Let S denote 
a subspace of R^, and let P5 denote the orthogonal projection onto S with respect to 
the usual inner product. If Y and X are independent conditioned on PsX, then we say 
that 5 is a dimension reduction subspace. The intersection of all such subspaces, if itself 
satisfies the conditional independence, is defined to be the central subspace (Cook [6] and 
Yin, Li and Cook [38]). When only the mean response E(E|X) is of interest, sufficient 
dimension reduction can be defined in a similar fashion. Specifically, a subspace S is said 
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to be a mean dimension reduction subspace if Y is independent of £i(F|X) given P 5 X. 
If the intersection of all mean dimension reduction subspaces is also a mean dimension 
reduction subspace, it is called the central mean subspace (Cook and Li [8]). In either 
case, sufficient dimension reduction permits us to restrict attention to a number d < p 
of new predictors, expressed as linear combinations of the original ones: (3j X,..., /3JX, 
where {/S^, ■ • ■,/3d} is a basis of S. 

In the last two decades or so, a series of papers have considered issues related to 
dimension reduction in regression. There have primarily been two categories of estimation 
methods in the literature: inverse regression methods (Li [17], Cook and Weisberg [10] 
and Cook and Ni [9]) and direct regression methods (Hardle and Stoker [14], Xia et al. 
[34] and Yin and Li [37]). Inverse regression methods, despite being computationally 
simple and widely used, require strong assumptions on predictors such as the linearity 
condition (Li [17]), and often fail to estimate the central subspace exhaustively (Cook 
[6]). In contrast, the minimum average variance estimation (MAVE) method of Xia et 
al. [34] has proven effective in dimension reduction and estimation of complicated semi- 
parametric models. The root-n consistency is still achievable for the MAVE estimate. 
Compared with other direct regression methods, the calculation for MAVE is much easier, 
and many efficient algorithms are available. Although MAVE was originally proposed for 
dimension reduction for the conditional mean, the idea was recently generalized to target 
the central subspace (Wang and Xia [28] and Yin and Li [37]). In this article, we are 
concerned mainly with predictors in the conditional mean. 

Dimension reduction is a fundamental statistical problem in both theory and practice. 
The aforementioned dimension reduction methods, however, suffer from the difficulty of 
interpreting the results, because the new extracted predictors usually involve all of the 
original ones. To handle this problem, model-free variable selection, in the framework of 
sufficient dimension reduction, has attracted considerable attention in recent years. For 
example, Li, Cook and Nachtsheim [18] introduced test-based procedures, Bondell and 
Li [1] incorporated inverse regression estimation with LASSO (Tibshirani [25]) to obtain 
shrinkage inverse regression estimation, and Chen, Zou and Cook [5] proposed a unified 
method called coordinate-independent sparse estimation. See also Zhu et al. [41] and 
Wang, Xu and Zhu [31]. All these methods, which are largely “parametric” in nature, 
are based on inverse regression methods and thus suffer the drawbacks of strong design 
assumptions and poor finite-sample performance (Wang, Xu and Zhu [30]). 

Exploring the idea of combining MAVE and LASSO, Wang and Yin [29] proposed 
a sparse MAVE method and Zeng, He and Zhu [39] designed for single-index models 
a lasso-type approach called sim-lasso. Because the sparse MAVE penalizes the index 
vectors directly, it is not a principled method for variable selection and only provides a 
sparse estimate for a basis matrix of the central mean subspace column by column. The 
use of the h penalty function in Zeng, He and Zhu [39] is novel in that it penalizes the 
index vector and the norm of the derivative of link function simultaneously. However, 
the theoretical properties of sim-lasso, such as its consistency and convergence rate, have 
not yet been studied due to the interaction between the bandwidth and the penalty 
parameter. Further, it is nontrivial, if not impossible, to extend sim-lasso to deal with 
multiple-index models. Several papers have addressed the problem of semi-parametric 
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variable selection for single-index models, and developed large sample properties. See, for 
instance, Liang et al. [20], Peng and Huang [21] and Wang, Xu and Zhu [30]. However, 
condition (vi) in Liang et al. [20] may not hold true and their approach could not be 
extended to handle multiple-index models. The penalized MAVE method in Wang, Xu 
and Zhu [30] was motivated by the reasoning that predictor selection can be realized 
through selection of nonvanishing rows of a basis matrix of the central mean subspace. 
A bridge penalty function was employed to penalize the h norms of the rows of a basis 
matrix. Although the penalized MAVE performs well for multiple-index models in the 
numerical studies, its theoretical properties are established only for the special case of 
single-index models. This is because, condition (C5) in Wang, Xu and Zhu [30], which 
is also assumed in Peng and Huang [21], is hard to check and possibly invalid except for 
single-index models. To the best of our knowledge, semi-parametric variable selection for 
multiple-index models has thus far not been well studied. 

In many engineering and scientific situations, however, predictors are naturally 
grouped. Eor example, in biological applications assayed genes or proteins can be grouped 
by biological pathways. Although useful, existing dimension reduction methods are 
generic and treat all predictors in X indiscriminately. To take advantage of such group 
knowledge, Li, Li and Zhu [19] proposed a group-wise sufficient dimension reduction 
method, called group-wise MAVE, which preserves full regression information in the 
conditional mean of Y given X while exploiting the group structure among predictors. 
Generally, it is believed that incorporating group information into dimension reduction 
can facilitate interpretation of results and improve estimation accuracy as the number of 
unknown parameters has been greatly reduced. 

As a simple illustration, we use an example to show the necessity of group- 
wise dimension reduction and variable selection. Consider a response model Y = 
/37Xi -b sin(0.27t/3jX2)-b 0.5e, where Xi e X 2 G /3i = (1,-1,0,.. .,0)^, 
(32 = (1,1,0 ,..., 0)^, and all predictors and e are independent standard normal vari¬ 
ables. Write X = (X]'^,Xj)^. Then the central mean subspace for ^(VjX) is spanned 
by (/37.0 io)^ and (07o,/3j)^, where Oio is a 10 x 1 vector of zeros. We then should rule 
out zeros and identify (3i and (32 or their linear combinations. A single but represen¬ 
tative simulated data set with 150 observations was obtained, and the MAVE direction 
estimates were 


(-0.624,0.647,0.006,0.057,0.034, -0.010, -0.013,0.033,0.023, -0.022, 


- 0.275, -0.316, -0.002, -0.062, -0.057,0.010,0.005, -0.034, -0.028, -0.017)^ 


and 


(0.141,0.379, -0.394,0.005,0.030, -0.313, -0.313,0.146,0.201, -0.341, 

0.106, -0.022, -0.286,0.111,0.096, -0.047, -0.303, -0.073, -0.203, -0.226)^. 

MAVE treats all predictors in X indiscriminately. While the first direction estimate seems 
reasonable, the second one is very poor, and thus the overall estimation accuracy must 
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be poor. Given the prior information that Xi and X 2 are two predictor groups, we apply 
group-wise MAVE, and the resulting estimates of [3i and ^2 respectively, given by 

(0.698, -0.715, -0.035, -0.031, -0.010, -0.003, -0.001,0.010, -0.010,0.015)^ 


and 


(0.717,0.666, -0.035,0.155,0.002,0.032, -0.008,0.111, -0.024, -0.058)^. 

A substantial gain in accuracy has been achieved by incorporating the predictor group 
information. Nevertheless, in each group all the predictors are included in the extracted 
linear combination, although some coefficients are small, obscuring the fact that only the 
first two predictors are contributing factors. It is obvious that group-wise MAVE cannot 
be the base for both dimension reduction and variable selection. Therefore, a selection 
operator also plays an important role, and we will see that a shrinkage penalty will be 
useful for us to use group-wise MAVE to exclude irrelevant predictors from the model. 

Two main features of this paper are listed below. 

1. We consider the problem of semi-parametric variable selection for multiple-index 
regression models. Although multiple-index models are popular in the statistics 
and econometrics literature, little work has been done on variable selection. We 
propose a shrinkage MAVE estimator by introducing a shrinkage factor for each 
row of an estimated basis matrix of the central mean subspace. Eor multiple-index 
models the proposed estimator is proved to be consistent in variable selection while 
retaining the root-n consistency. However, although the estimation problem can 
be reformulated as a LASSO problem in spirit, the LASSO problem under study 
has an asymptotically singular design matrix (Knight and Eu [16]). This is because 
the MAVE procedure is a combination of nonparametric function estimation and 
direction estimation. This makes the theoretical investigation more complicated. 
To deal with this issue for single-index models, condition (C5) is assumed in Wang, 
Xu and Zhu [30], otherwise, the large sample properties are difficult to derive. For 
multiple-index models, the standard approach of LASSO with nonsingular designs 
fails to show the large sample properties. Therefore, in this paper, the results of 
mixed-rates asymptotics (Radchenko [22]) are adopted to derive the asymptotic 
behavior even the design matrix is asymptotically singular. This is a new skill about 
proving the asymptotics of the LASSO estimation for semi-parametric models. The 
interaction between the bandwidth and the penalty parameter now is explicitly 
shown in Theorem 2.1. 

2. We propose a general knowledge-based method that accounts for prior group infor¬ 
mation. As we have explained before, the group structure leads to a reduction in the 
total number of parameters. Consequently, our method, which is motivated by and 
derives from dimension reduction, doubly alleviates the “curse of dimensionality”. 
As a by product, such a structure also makes the computation more efficient. 

The paper is organized as follows. In Section 2.1, we review the group-wise mini¬ 
mum average variance estimation. In Section 2.2, we combine group-wise MAVE with 
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the LASSO penalty, as an example, to propose a shrinkage group-wise MAVE estima¬ 
tor. This method does not require any restrictive design assumptions, and is capable of 
simultaneous dimension reduction and variable selection. The asymptotic properties of 
the new estimator are established in Section 2.3. We also use a criterion, which has the 
same form as the Bayesian information criterion (BIG; Schwarz [23]), to select the opti¬ 
mal tuning parameter. Moreover, we establish the consistency of the resulting BIC-type 
selector. Numerical studies are presented in Section 3. As many shrinkage penalties can 
also be applied, we then include the simulation results with two other penalties as well. 
All technical proofs are relegated to the Appendix. 

2. Methodology 

We begin with some basic notations and terminology. For a positive integer to, Im stands 
for the TO X TO identity matrix. For an toi x TO 2 matrix A, span(A) represents the column 
space of A and Pa represents the orthogonal projection onto span(A). For a subspace 
S of R™, if A is a matrix of full column rank and span(A) = S, then we call A a basis 
matrix of S. Moreover, P 5 represents the projection onto S, that is, P 5 =Pa, where 
A is any basis matrix of S. For an rn-dimensional vector w = {wi,... ,Wm)^, diag(w) 
denotes a diagonal matrix whose diagonal entries starting in the upper left corner are 
wi,..Wm ■ We use Ai 0 • • • 0 Ag, or simply A;, to denote a block diagonal matrix 
with matrices Ai,..., Ag on the diagonal. 

2.1. A short review 

In this subsection, we review group-wise dimension reduction for the regression mean 
function and the group-wise minimum average variance estimation. We refer the reader 
to Li, Li and Zhu [19] for more details. 

Let Si,.. .,Sg be subspaces of W that form an orthogonal decomposition of that 
is, = 5i 0 • • • 0 5g, where 0 denotes the direct sum operator. If there are subspaces 
Ti Q Si for I = I,..., g such that E{Y\X.) = E(Y\Pji,. ■ ■, ^Tg), then we say that 71 0 
•••07^ is a group-wise mean dimension reduction subspace with respect to {iSi,..., Sg}. 
Under very mild conditions (Yin, Li and Cook [38]), the intersection of all group-wise 
mean dimension reduction subspaces, with respect to a given orthogonal decomposition 
{iSi,..., iSg}, exists uniquely. We call this subspace the group-wise central mean subspace 
and denote it as SE{Y\yi)iSi, ■ ■ ■ ,Sg). By definition, 

SEiYmiSi,...,Sg)=Ti*(B---(BTg* 

for some subspaces Tf C 5i,..., 7^* C 5g. Let pi , di and d denote the dimensions of Si , T* 

and 5 b(y|x)('?i, ... ,Sg), respectively. Then we havep =pi H-hpg and d = diA -hdg. 

Let P; G RP^P' be a basis matrix of Si, and let V; = P;^X G RP' . We note that compo¬ 
nents of V; correspond to predictors in group I, and all the group information contained 
in Pi’s is available as prior knowledge. By construction, there are matrices B* G RP'’^'^' 
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for l = .,g such that span(riB*) = T* ■ Write B* = B*. We are interested in 

estimating B* or its column space span(B*). 

Li, Li and Zhu [19] proposed the group-wise MAVE estimator such that the matrix 
B* = B* is the minimizer of 

A{r-A(r|B7Vi,...,B;Vg)f 

with respect to Bi G , B^ G , subject to Bj^B; = 1^, for / = 1 ,... , 5 . 

Let V = (V^,..., Vj)^ G RP and B = B,. Then we have 

E{Y - E(Y\Bj Vi,..., BjVg)}^ = E{al{B^V)}, 

where cr|(B^V) = E;[{F - £;(r|B7Vi,..., BjVg)}2|B7Vi,..., BjVg] is the condi¬ 
tional variance of Y given B^ Vi,..., B J Vg. 

Suppose that {(y®, v®),i = 1,... ,n} is a random sample from (F, V). Extending the 
MAVE idea, we can use local linear smoothing to estimate tT 3 (B^V). Specifically, for 
any given G R^, we have the following approximation 

n 

4 (B^v°) « - A(r|B^V = 

j=i 

j=l [ 1=1 ) 


where rc^’s are kernel weights such that + Sf=i b°^B7(v^ — v°) is 

the local linear expansion of E(y|B^V = B^v-l) at v°. 

Consequently, we can recover the group-wise central mean subspace by minimizing the 
objective function 


n n f g 

EE 2/^-«*-EbrB7K 



( 2 . 1 ) 


with respect to a® G R, bj G R®^®,..., b® G R®^®, f = 1,..., n, and B; G R^' with B^B; = 
Id, for Z = 1,..., y. To allow the estimation to be adaptive to the regression structure, we 
follow the idea of refined MAVE (Xia et al. [34] and Li, Li and Zhu [19]) and adopt the 
weights 

, iC,{BT(v^-v®)} 

E;=i^MBT(W-v®)}’ 


where Kh{-) is a d-dimensional kernel with bandwidth h, and B is taken to be the current 
or latest estimate. 
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The minimization problem in (2.1) can be solved by fixing (a*, ,bp, i = 1,... ,n, 

and fixing alternatively. Thus, the calculation can be decomposed into two op¬ 

timization problems both of which have simple analytic solutions. The details of the 
group-wise MAVE algorithm can be found in Section 3.2 of Li, Li and Zhu [19]. Let 
denote the group-wise minimum average variance estimator. 


2.2. Shrinkage group-wise minimum average variance estimation 


The group-wise MAVE method captures the full regression information in E{Y\X.) while 
preserving the group structure in X. Specifically, it can provide a consistent estimator 
of B*D° for some d x d nonsingular matrix D° = D°, where D° £ for I = 

I,... ,g. However, the elements of B;’s are usually nonzero. Consequently, the extracted 
predictor vector B^V; corresponding to group I consists of linear combinations of all 
the predictors in that group. When there are a large number of predictors, one would 
expect that only a subset of predictors are relevant to the response variable. Write V = 
(Vi,..., VpY' . According to Proposition 1 of Cook [7], Vg is irrelevant if and only if the 
sth row of B* is a zero vector. Eurther, it is easy to see that for any d x d nonsingular 
matrix D, when a row of B* is zero, the corresponding row of B*D is also zero, and 
vice versa. These observations motivate us to employ the state-of-the-art methods for 
simultaneous shrinkage estimation and variable selection, such as LASSO, to design a 
sparse version of the group-wise MAVE procedure which shrinkages some rows of B to 
be exactly zero vectors. 

Define 



jL4B^(vJ-V)} 


i,j = 


For each * = 1,..., n, let (a*, b)^,..., b^) be the minimizer of 


nr g 

3=1 I l=l 



( 2 . 2 ) 


In the sequel, we shall use an updated version of the group-wise minimum average vari¬ 
ance estimator, B = B;, which is the minimizer of 







rBr(v^ 



(2.3) 


Definition 2.1. A shrinkage group-wise minimum average variance estimator is defined 
as 
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B = 0diag(Q:i)Bi, 
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where the shrinkage index vectors og = {&ii ,..., G for I = 1,... ,g are deter¬ 

mined by minimizing 


EE 



^bJ^{diag(ai)B;}^(vf 

Z=1 



(2.4) 


with respect to cti = {an,... ,aipff' € R^', I = . ,g, subject to X)f=i SsLi ^ P 

for some > 0. 


To solve the above optimization problem, we note that (2.4) can be re-expressed as 

. . ^ . IV 

yJ - h* - E diag(v^ - v])ai I . 

i=i j=i t 1=1 J 

Equivalently, the shrinkage index vectors minimize 

n n e g _ g Pi 

diag(vf - -vDai I w* -|- A„ EE |a;s| (2.5) 

i=l j=l [ 1=1 ) 1=1 s=l 

for some tuning parameter A„ > 0. As a result, commonly-used LASSO algorithms, such 
as those of Efron et al. [11] and Eriedman, Hastie and Tibshirani [13], can be applied to 
obtain the shrinkage index vectors A; for I = 1,..., g. 

When Tn > p, the indices ais = 1 for all! = 1,..., y and s = 1,... ,pi, and so B reduces to 
the usual group-wise MAVE estimator B. As t„ gradually decreases, some of the indices 
are shrunk to zero, which means some rows of B are zero; that is, the corresponding 
predictors are irrelevant to the response variable given the other predictors. 



2.3. Asymptotic theory 

We next study the large-sample properties of the proposed method. Eor an mi x m 2 ma¬ 
trix A, we say that A is row-sparse if some of its rows are zero. Let T(A) C {1,... ,m} 
denote the subset of indices corresponding to nonzero rows of A. Clearly, the notion of 
row-sparseness is nonsingular-transformation independent, since for any m 2 x m 2 non¬ 
singular matrix 0, 1(A) =I(AO). Suppose that B* = B* is row-sparse. Without 
loss of generality, we assume that for I = the hrst qi rows of B* are nonzero, 

that is, T(B*) = {1,... ,qi}. The following theorem concerns the asymptotic behavior of 
shrinkage group-wise MAVE. 

Theorem 2.1. Suppose that the regularity conditions (A1)-(A6) given in the Appendix 
hold. // A„ —>■ 00 and —>■ 0, then we have 

(1) selection consistency: P{I(Bj) = I(B*), / = 1,..., y} —1, and 
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(2) root-n consistency: B; = B*D° + Op(n for I = 1,... ,g. 

Theorem 2.1, part (1), demonstrates that the shrinkage group-wise MAVE method can 
efficiently remove unimportant predictors, while part ( 2 ) implies that the estimator that 
corresponds to relevant predictors is root-n consistent. As we can see, the result is very 
similar to that of adaptive LASSO for linear models (Zou [42]). In fact, we shall show 
in the proof that shrinkage group-wise MAVE is closely related to an adaptive LASSO 
problem. A similar phenomena can be found in Bondell and Li [1] where they studied the 
shrinkage inverse regression estimation. However, unlike linear models, we need to study 
the interplay between the bandwidth h and the penalty parameter A. This is explicitly 
shown in Theorem 2.1 in which we require that A —>■ oo and —?► 0. We also 

note that, although it is possible to derive the asymptotic distribution, the form of the 
asymptotic variance is rather complicated and thus is not pursued here. 

As a direct application we consider the special case when 3 = 1 , that is, there is no 
group information available. It follows that the shrinkage MAVE estimator possesses 
exactly the same properties. 

Corollary 2 . 1 . Assume that g = l, and that the regularity conditions (A1)-(A6) given 
in the Appendix hold. // A„ —^ oo and Xnn~^^^h~^ —>■ 0, then we have 

(1) selection consistency: P{I(B) =I(B*)}—> 1, and 

(2) root-n consistency: B = B*D° -)- 

The attractive properties of shrinkage group-wise MAVE depend critically on an ap¬ 
propriate choice of the tuning parameter, for which prediction based criteria such as 
generalized cross-validation have been commonly used in practice. However, it is well 
known that this practice tends to produce over-fitted models. For model selection consis¬ 
tency, it has been verified that tuning parameter selectors with the Bayesian information 
criterion are able to identify the true model consistently; see for example Wang, Li and 
Tsai [27] and Wang and Leng [26]. In the following, we propose a criterion which is similar 
in form to the classical Bayesian information criterion. 

Let a(A) = ■ • ■, ■ Write q;(A) = (di,..., ap)~^ and V = (lA,..., Vp)~^. We use 

the notation A4 = {ri, r 2 ,..., Cp*} C { 1 ,... ,p} to denote an arbitrary candidate model 

which includes predictors {14, s G AI}. Let fci = 0 and ki=pi-\ - \-pi-i for Z = 2 ,..., g. 

Then, Alp = (1,... ,p} and Mr = Uf=i{^i + ^, ■ ■ ■ ,ki -\- qi} represent the full model and 
the true model, respectively. Finally, we use jAI] to denote the size of the model A4. 

Let AJa = (s : ds 0} be the model that is identified by q:(A) or B. Define 



We select the optimal A by minimizing 


BICA = l0 g(RSSA)+dfA 


log(n) 


n 


( 2 . 6 ) 
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where cHa denotes the effective number of parameters in the shrinkage group-wise MAVE 
estimator. The resulting optimal regularization parameter is denoted by Abic- Following 
the discussion of Zou, Hastie and Tibshirani [43] about the degrees of freedom of the 

LASSO estimator, we approximate dfA by di\M\\ H-|-dg|A4®|, where M\ represents 

the index set of identified predictors in group 1. 

We now establish the asymptotic property of the BIC-type tuning parameter selector. 

Theorem 2.2. Suppose that the regularity conditions (A1)-(A6) given in the Appendix 
hold. Then we have =Mt) —11- 

Remarks. 

1. Mixed-rates behavior naturally arises in the estimation of semi-parametric mod¬ 
els. As shown in the proof of Theorem 2.1, the objective function (2.5) can be 
decomposed into two components with different convergent rates. As a result, the 
standard approach does not yield the complete limiting behavior of the estimator. 
Fortunately, we are able to derive the asymptotic behavior by directly applying 
results from mixed-rates asymptotics (Radchenko [22]). 

2. In practice, one may use a concave penalty other than the LASSO penalty. We 
have tried using the smoothly clipped absolute deviation penalty (Fan and Li [12]) 
and the minimax concave penalty (Zhang [40]), and have found that the resulting 
estimators enjoy the same properties. See Section 3 for a numerical comparison of 
these methods. Consider again the illustrative example in Section 1, the proposed 
sparse group-wise MAVE method, when the smoothly clipped absolute deviation 
penalty is used, yielded the direction estimates 

(0.702,-0.712,0,...,0)T and (0.722,0.692,0,0.027,0,..., 0)^. 

As we can see, all except one of the coordinates corresponding to irrelevant predic¬ 
tors were correctly shrunk to zero. 

3. The result here is applicable to a general class of semi-parametric models. In par¬ 
ticular, it provides an alternative method for estimation and selection for partially 
linear single-index models in which two groups exist naturally (Xia and Hardle 
[33]). Further, the new method can be adjusted to handle dimension reduction and 
variable selection with censored data (Xia, Zhang and Xu [35]). 

4. Although in this paper we focus on shrinkage estimation of the group-wise central 
mean subspace, the same strategy can be used to target the group-wise central 
subspace. To see this, we note that Wang and Xia [28] have modified MAVE to 
estimate the central subspace, and so group-wise MAVE can be modified in a similar 
way to estimate the group-wise central subspace; see Section 8 of Li, Li and Zhu 
[19] for more discussion. To conclude, we believe that these efforts would enhance 
the usefulness of the shrinkage MAVE method in data analysis. 


Estimation for multi-index models 

3. Numerical studies 


11 


3.1. Simulation studies 

In this subsection, we use simulations to evaluate the finite-sample performance of the 
shrinkage group-wise MAVE method. For comparison we consider the LASSO penalty, the 
smoothly clipped absolute deviation (SCAD) penalty and the miniinax concave penalty 
(MCP) in the simulation. The resulting estimators, including group-wise MAVE, are 
denoted by SgMAVE-LASSO, SgMAVE-SCAD, SgMAVE-MCP and gMAVE, respec¬ 
tively. Throughout the following numerical studies we adopt the Gaussian kernel and use 
the optimal bandwidth h= {4/((i-I-The R code that we used for 
group-wise MAVE is available at http://www4.stat.ncsu.edU/-li/software.h.tml. 
SgMAVE-LASSO is computed using the least angle regression algorithm (Efron et al. 
[11]), while SgMAVE-SCAD and SgMAVE-MCP are computed using the coordinate de¬ 
scent algorithms described by Breheny and Huang [2] . The entire R code can be requested 
from the authors. 

To evaluate estimation accuracy, we compute the vector correlation coefficient (VCC), 
which is defined as (Ofli i s-nd the trace correlation coefficient (TCC), which is de¬ 
fined as {di~^ where the cj)^'s are the eigenvalues of the matrix . 

These two measures range between 0 and 1, with larger values indicating a more accurate 
estimator; see Ye and Weiss [36] for more information. We also employ three summary 
statistics to assess how well the methods select predictors: the average model size (MS), 
which is the average number of nonzero rows of Bq the true positive rate (TPR), which 
is the average fraction of nonzero rows of B; associated with relevant predictors; and the 
false positive rate (FPR), which is the average fraction of nonzero rows of B; associated 
with irrelevant predictors. Both TPR and FPR range between 0 and 1, and ideally, we 
wish to have TPR to be close to 1 and FPR to be close to 0 at the same time. We report 
the results using the BIC-type criterion (2.6) to select tuning parameters. 

The predictor vector V = (Vj'^,..., Vj)''' is generated from N{Op, S) in each example. 
We examine two commonly-used correlation structures among the predictors. The first 
is autoregressive, Egt = for all s,f = 1 ,... ,p. Consequently, the predictors with 

large distances in order are expected to be mutually independent approximately. The 
second is compound symmetry, Ess = 1 and Est = 0.5 for any s ^ f, so all the predictors 
are equally correlated with each other. 

Example 3.1. In this experiment, we set 


{g,Pi,P2,di,d2,qi,q2) = (2,20,20,1,1,3,2). 

Thus, there are two groups, Vi and V 2 , and each group consists of twenty predictors. 
Further, each predictor group is connected with the response variable through a sin¬ 
gle linear combination. Specifically, the response variable is generated from each of the 
following three models: 

Y = /37Vi(l+/3jV2) + 0.5e, 


(3.1) 
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Y = (3l Vi/{0.5 + (1.5 + (3l Va)^} + 0.5e, (3.2) 

Y = exp(0.5/3]^ Vi) + sin(0.27t/3j V 2 ) + 0.5e, (3.3) 

where = (1,1,1,0,..., 0)^, /32 = (1,1,0,..., 0)^, e ~ iV(0,1), and e is independent of 
all predictors. We let n = 200. 

Table 1 presents the simulation results based on 200 data replications for these three 
models. As we can see, all methods considered show very good performance, but the 
shrinkage ones often achieve higher estimation accuracy than the one without shrinkage. 
Further, although none of the three shrinkage methods can universally dominate the other 
two competitors, SgMAVE-SCAD and SgMAVE-MCP tend to produce sparser solutions 
than SgMAVE-LASSO. Finally, the performance of the group-wise MAVE estimator and 
its shrinkage versions is only slightly affected by the correlation structure among the 
predictors. 

Example 3.2. In this experiment, we set 


{g,Pi,P2,di,d2,qi,q2) = (2,20,20,2,1,91,2). 

Thus, there are two groups, Vi and V 2 , and each group consists of twenty predictors. 
Eurther, the first predictor group is connected with the response variable through two lin¬ 
ear combinations and the second predictor group is connected with the response variable 
through a single linear combination. The regression model is 

r = 2.5/37 iVi/{0.5 + (1.5 + + (3JY2 + 0.5e, (3.4) 

where = (1,1,0,..., 0)^, /32 = (1,1,0,... ,0)^, e ~ N{0, 1), and e is independent of all 
predictors. We consider two cases. In Case 1: we set 91 = 2 and (3^2 = (1, —1,0,..., 0)'''. 
In Case 2: we set 91 = 4 and / 3 i 2 = (0,0,1,1,0,.. .,0)^. We let n = 200. 

Table 2 summarizes the simulation results out of 200 data replications for Case 1 and 
Case 2. As in the previous example, we have the same observations. Unreported results 
also show that the BIC-type criterion has a pretty large rate of correctly identifying the 
true model in this example. 

Example 3.3. In this experiment, we set 

{9,Pi,P2,P3,di,d2,d3,qi,q2,q3) = (3,po,Po,4'o, 1,1,1, 2, 2, 2). 

Thus, there are three groups, Vi, V 2 and V 3 , and each group consists of po predictors. 
Further, each predictor group is connected with the response variable through a single 
linear combination. We consider the following two models: 

Y = ^JYi+ 2f3j V2/{0.5 + (1.5 + f3j V 3 )"} + 0.5£, 

Y = f3jYi+ 0.2(2 + /3j V 2 )" + 2 sin(0.27t/3j V 3 ) + 0.5e, 


(3.5) 

(3.6) 


Table 1. Summary of Example 3.1. The average vector correlation coefficient (VCC) with standard error in parentheses, the 
average number of predictors selected (MS), true positive rate (TPR) and false positive rate (FPR), based on 200 data replications, 
are reported 



/3i 




O 2 




VCC 

MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 




Model (3.1): autoregressive correlation 




gMAVE 

0.9963 (0.0239) 




0.9905 (0.0780) 




SgMAVE-LASSO 

0.9895 (0.0996) 

4.1550 

0.9900 

0.1692 

0.9896 (0.0997) 

3.1850 

0.9900 

0.1506 

SgMAVE-SCAD 

0.9976 (0.0207) 

3.0450 

1.0000 

0.0064 

0.9898 (0.0997) 

2.0150 

0.9900 

0.0043 

SgMAVE-MCP 

0.9977 (0.0193) 

3.1200 

1.0000 

0.0171 

0.9897 (0.0997) 

2.0550 

0.9900 

0.0093 




Model (3.1): compound symmetry 




gMAVE 

0.9923 (0.0704) 




0.9818 (0.1108) 




SgMAVE-LASSO 

0.9933 (0.0709) 

4.8450 

0.9950 

0.2657 

0.9808 (0.1254) 

4.0300 

0.9900 

0.2562 

SgMAVE-SCAD 

0.9935 (0.0710) 

3.1400 

0.9950 

0.0221 

0.9811 (0.1252) 

2.1300 

0.9875 

0.0193 

SgMAVE-MCP 

0.9934 (0.0710) 

3.1950 

0.9950 

0.0300 

0.9805 (0.1294) 

2.1050 

0.9825 

0.0175 




Model 

(3.2): autoregressive correlation 




gMAVE 

0.9771 (0.0137) 




0.9735 (0.0538) 




SgMAVE-LASSO 

0.9885 (0.0104) 

5.5350 

1.0000 

0.3621 

0.9846 (0.0477) 

4.1400 

0.9975 

0.2681 

SgMAVE-SCAD 

0.9915 (0.0103) 

3.7300 

1.0000 

0.1042 

0.9849 (0.0557) 

2.5700 

0.9950 

0.0725 

SgMAVE-MCP 

0.9886 (0.0116) 

4.0100 

1.0000 

0.1442 

0.9837 (0.0550) 

2.5750 

0.9950 

0.0731 




Model (3.2): compound symmetry 




gMAVE 

0.9739 (0.0177) 




0.9432 (0.1535) 




SgMAVE-LASSO 

0.9856 (0.0120) 

5.7650 

1.0000 

0.3950 

0.9450 (0.1940) 

3.7450 

0.9625 

0.2275 

SgMAVE-SCAD 

0.9896 (0.0130) 

3.8300 

1.0000 

0.1185 

0.9486 (0.1923) 

2.3500 

0.9600 

0.0537 

SgMAVE-MCP 

0.9858 (0.0132) 

4.0500 

1.0000 

0.1500 

0.9438 (0.2031) 

2.3000 

0.9550 

0.0487 
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Table 1. (Continued) 



/3i 




(^2 




VCC 

MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 




Model (3.3): autoregressive correlation 




gMAVE 

0.9955 (0.0026) 




0.9648 (0.0214) 




SgMAVE-LASSO 

0.9981 (0.0019) 

5.2600 

1.0000 

0.3228 

0.9879 (0.0158) 

3.6250 

1.0000 

0.2031 

SgMAVE-SCAD 

0.9984 (0.0019) 

3.8250 

1.0000 

0.1178 

0.9832 (0.0726) 

2.6850 

0.9950 

0.0868 

SgMAVE-MCP 

0.9981 (0.0020) 

3.6450 

1.0000 

0.0921 

0.9874 (0.0191) 

2.5650 

1.0000 

0.0706 




Model (3.3): compound symmetry 




gMAVE 

0.9954 (0.0023) 




0.9546 (0.0401) 




SgMAVE-LASSO 

0.9974 (0.0019) 

5.8050 

1.0000 

0.4007 

0.9766 (0.0384) 

3.9250 

1.0000 

0.2406 

SgMAVE-SCAD 

0.9981 (0.0019) 

4.0600 

1.0000 

0.1514 

0.9792 (0.0517) 

2.8800 

0.9975 

0.1106 

SgMAVE-MCP 

0.9977 (0.0021) 

3.7850 

1.0000 

0.1121 

0.9786 (0.0499) 

2.5950 

0.9975 

0.0750 
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Table 2. Summary of Example 3.2. The average of the vector correlation coefficient (VCC) and the trace correlation coefficient 
(TCC) with standard errors in parentheses, the average number of predictors selected (MS), true positive rate (TPR) and false 
positive rate (FPR), based on 200 data replications, are reported 



/3i — (/3ii 1 1 ^ 12 ) 





-92 




VCC 


TCC 

MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 




Model (3.4); Case 1, autoregressive correlation 




gMAVE 

0.9506 

(0.0972) 

0.9762 (0.0397) 




0.9667 (0.0249) 




SgMAVE-LASSO 

0.9936 

(0.0707) 

0.9978 (0.0207) 

3.4150 

1.0000 

0.1271 

0.9867 (0.0997) 

2.2450 

0.9900 

0.0147 

SgMAVE-SCAD 

0.9916 

(0.0772) 

0.9968 (0.0256) 

2.4350 

1.0000 

0.0755 

0.9917 (0.0734) 

2.0900 

0.9925 

0.0058 

SgMAVE-MCP 

0.9897 

(0.0774) 

0.9973 (0.0158) 

2.6900 

1.0000 

0.0889 

0.9912 (0.0734) 

2.1700 

0.9925 

0.0102 




Model (3.4): 

Case 1, 

compound 

symmetry 




gMAVE 

0.9515 

(0.0709) 

0.9760 (0.0312) 




0.9616 (0.0191) 




SgMAVE-LASSO 

0.9950 

(0.0170) 

0.9975 (0.0085) 

4.4250 

1.0000 

0.1802 

0.9933 (0.0224) 

2.9200 

0.9975 

0.0513 

SgMAVE-SCAD 

0.9953 

(0.0195) 

0.9976 (0.0097) 

2.9250 

1.0000 

0.1013 

0.9958 (0.0213) 

2.1700 

0.9975 

0.0097 

SgMAVE-MCP 

0.9879 

(0.0302) 

0.9940 (0.0149) 

3.4700 

1.0000 

0.1300 

0.9933 (0.0136) 

2.5150 

1.0000 

0.0286 




Model (3.4); Case 2, autoregressive correlation 




gMAVE 

0.9548 

(0.0764) 

0.9775 (0.0348) 




0.9686 (0.0176) 




SgMAVE-LASSO 

0.9498 

(0.1843) 

0.9786 (0.0877) 

8.2150 

0.9787 

0.2687 

0.9887 (0.0739) 

4.6850 

0.9925 

0.1500 

SgMAVE-SCAD 

0.9723 

(0.1289) 

0.9891 (0.0428) 

6.2700 

0.9925 

0.1437 

0.9952 (0.0183) 

2.7600 

1.0000 

0.0422 

SgMAVE-MCP 

0.9797 

(0.0837) 

0.9907 (0.0319) 

5.6850 

0.9975 

0.1059 

0.9939 (0.0165) 

2.6150 

1.0000 

0.0341 




Model (3.4): 

Case 2, 

compound 

symmetry 




gMAVE 

0.9584 

(0.0559) 

0.9795 (0.0219) 




0.9648 (0.0167) 




SgMAVE-LASSO 

0.9827 

(0.0724) 

0.9923 (0.0228) 

8.8900 

0.9975 

0.3062 

0.9918 (0.0162) 

5.1550 

1.0000 

0.1752 

SgMAVE-SCAD 

0.9871 

(0.0722) 

0.9945 (0.0224) 

5.3550 

0.9975 

0.0853 

0.9962 (0.0108) 

2.5300 

1.0000 

0.0294 

SgMAVE-MCP 

0.9831 

(0.0727) 

0.9925 (0.0230) 

5.4800 

0.9975 

0.0931 

0.9934 (0.0146) 

2.5200 

1.0000 

0.0288 
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where = ( 1 ,- 1 , 0 ,..., 0 )^, = ( 1 , 1 , 0 ,..., 0 )^, /33 = (1,-1, 0 ,..., 0 )^, £~iV(0,l), 

and e: is independent of all predictors. We let {n,po) be (100,10), (200,20) and (200,30). 

The simulation results for models (3.5) and (3.6), based on the 200 data replications, 
are shown in Tables 3 and 4, respectively. In general, the results show that reasonably, 
increasing the sample size improves the performance, while increasing the dimension of 
predictors makes the performance worse. Moreover, the empirical performance of the 
shrinkage estimators relies on the initial estimator as expected. Thus, the development 
of a shrinkage estimation and variable selection method that depends less on the initial 
estimator can be practically useful, and we will work along this line in our future study. 

As we mentioned before, the MAVE procedure is a combination of nonparametric 
function estimation and direction estimation; it is an iterative procedure with each cycle 
consisting of two least squares problems. As we know, inverse regression based methods, 
which are largely “parametric” in nature, are simple and easy to use. Thus, the proposed 
approach is computationally more demanding than inverse regression based methods, 
especially when the sample size and the predictor dimension are very high. Table 5 shows 
the average CPU times, based on 200 data replications, for the shrinkage group-wise 
MAVE method (along with penalty parameter selection) for model (3.6) in Example 3.3. 
All algorithms are implemented as R language functions, and all timings were carried out 
on a Dell Poweredge R410 dual processors server equipped with Six Core Xeon X5670 
2.93 CHz CPU, 64 CB RAM running CentOS 5 Linux. We see that the times depend on 
both n and p. We also find similar results (unreported) for the other models considered in 
the simulation studies. Nevertheless, we emphasize that, as opposed to inverse regression 
based methods which require strong conditions on the distribution of predictors, direct 
regression based methods such as MAVE need relatively weak conditions such as the 
smoothness of the link function, and they often have much better performance for finite 
samples. 

3.2. Pyrimidine data 

A common step in drug design is the formation of a quantitative structure-activity re¬ 
lationship (QSAR; So [24]). The QSAR analysis is to relate a numerical description 
of molecular structure to known biological activity. The pyrimidine data set, which is 
available in the UCI machine-learning repository at http : //archive. ics .uci . edu/ml/ 
machine-learning-databases/qsar/, was studied by Hirst, King and Sternberg [15] to 
model the QSAR of the inhibition of dihydrofolate reductase (DHFR) by pyrimidines. 
It contains a structural information on 74 2,4-diamino-5-(substituted benzyl)pyrimidines 
used as inhibitors of DHFR in Escherichia coli. Each pyrimidine compound has 3 positions 
of substitution where chemical activity occurs, and at each position the substituent is as¬ 
signed nine physicochemical attributes: polarity (PL), size (SZ), flexibility (EL), number 
of hydrogen-bond donors (HD), number of hydrogen-bond acceptors (HA), strength and 
presence of 7 r-donors (ttD), strength and presence of 7 r-acceptors (ttA), polarisability of 
the molecular orbitals (PO) and cr-effect (ctE). The response variable is the experimen¬ 
tally assayed activity of the inhibitors. 


Table 3. Summary of Example 3.3. The average vector correlation coefficient (VCC) with standard error in parentheses, the 
average number of predictors selected (MS), true positive rate (TPR) and false positive rate (FPR), based on 200 data replications, 
for model (3.5), are reported 



/3i 





^2 




^3 




VCC 


MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 





Model (3.5): (n,po) = (100,10), 

autoregressive 

correlation 




gMAVE 

0.966 

(0.021) 




0.965 (0.023) 




0.956 (0.100) 




SgMAVE-LASSO 

0.967 

(0.106) 

5.850 

0.980 

0.486 

0.980 (0.025) 

7.060 

0.997 

0.633 

0.965 (0.127) 

6.075 

0.980 

0.514 

SgMAVE-SCAD 

0.952 

(0.170) 

4.230 

0.967 

0.286 

0.980 (0.021) 

4.745 

1.000 

0.343 

0.960 (0.147) 

4.225 

0.977 

0.283 

SgMAVE-MCP 

0.972 

(0.102) 

3.585 

0.987 

0.201 

0.979 (0.021) 

4.180 

1.000 

0.272 

0.960 (0.146) 

3.675 

0.980 

0.214 






Model (3.5): (n,po) = 

(100,10), compound symmetry 




gMAVE 

0.949 

(0.046) 




0.950 (0.049) 




0.888 (0.225) 




SgMAVE-LASSO 

0.935 

(0.169) 

5.135 

0.950 

0.404 

0.953 (0.128) 

6.270 

0.975 

0.540 

0.881 (0.282) 

5.245 

0.915 

0.426 

SgMAVE-SCAD 

0.952 

(0.130) 

3.950 

0.977 

0.249 

0.967 (0.052) 

4.560 

1.000 

0.320 

0.892 (0.258) 

3.875 

0.932 

0.251 

SgMAVE-MCP 

0.966 

(0.052) 

3.460 

0.995 

0.183 

0.961 (0.081) 

4.100 

0.995 

0.263 

0.888 (0.264) 

3.420 

0.922 

0.196 





Model (3.5): (n,po) = (200,20), 

autoregressive 

correlation 




gMAVE 

0.971 

(0.018) 




0.972 (0.014) 




0.967 (0.087) 




SgMAVE-LASSO 

0.950 

(0.198) 

4.425 

0.952 

0.140 

0.994 (0.004) 

8.840 

1.000 

0.380 

0.960 (0.184) 

5.645 

0.962 

0.206 

SgMAVE-SCAD 

0.972 

(0.142) 

4.545 

0.975 

0.144 

0.992 (0.009) 

6.060 

1.000 

0.225 

0.979 (0.121) 

4.900 

0.985 

0.162 

SgMAVE-MCP 

0.987 

(0.071) 

3.525 

0.995 

0.085 

0.989 (0.010) 

5.125 

1.000 

0.173 

0.978 (0.121) 

3.905 

0.985 

0.107 






Model (3.5): (n,po) = 

(200,20), compound symmetry 




gMAVE 

0.961 

(0.031) 




0.962 (0.028) 




0.926 (0.175) 




SgMAVE-LASSO 

0.926 

(0.235) 

3.505 

0.925 

0.091 

0.978 (0.104) 

7.505 

0.985 

0.307 

0.912 (0.269) 

4.550 

0.915 

0.151 

SgMAVE-SCAD 

0.942 

(0.213) 

3.625 

0.950 

0.095 

0.982 (0.073) 

5.080 

0.995 

0.171 

0.925 (0.241) 

3.990 

0.935 

0.117 

SgMAVE-MCP 

0.983 

(0.029) 

3.535 

1.000 

0.085 

0.980 (0.026) 

4.855 

1.000 

0.158 

0.926 (0.232) 

3.690 

0.935 

0.101 
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Table 3. (Continued) 



/3i 





(^2 





/33 





VCC 


MS 

TPR 

FPR 

VCC 


MS 

TPR 

FPR 

VCC 


MS 

TPR 

FPR 





Model (3. 

5): (n,po) = (200,30), 

autoregressive 

correlation 




gMAVE 

0.943 

(0.026) 




0.938 

(0.030) 




0.943 

(0.038) 




SgMAVE-LASSO 

0.793 

(0.375) 

2.895 

0.787 

0.047 

0.993 

(0.005) 

8.500 

1.000 

0.232 

0.874 

(0.308) 

4.200 

0.867 

0.088 

SgMAVE-SCAD 

0.913 

(0.257) 

4.485 

0.912 

0.095 

0.985 

(0.021) 

7.535 

1.000 

0.197 

0.977 

(0.121) 

5.285 

0.985 

0.118 

SgMAVE-MCP 

0.976 

(0.121) 

3.710 

0.985 

0.062 

0.979 

(0.022) 

6.790 

1.000 

0.171 

0.990 

(0.014) 

4.395 

1.000 

0.085 






Model (3.5): (n,po) = 

(200,30), compound symmetry 




gMAVE 

0.907 

(0.058) 




0.901 

(0.063) 




0.836 

(0.223) 




SgMAVE-LASSO 

0.828 

(0.333) 

2.350 

0.817 

0.025 

0.952 

(0.173) 

6.535 

0.960 

0.164 

0.827 

(0.353) 

3.295 

0.822 

0.058 

SgMAVE-SCAD 

0.848 

(0.328) 

2.930 

0.845 

0.044 

0.963 

(0.125) 

5.250 

0.982 

0.117 

0.837 

(0.347) 

3.750 

0.845 

0.073 

SgMAVE-MCP 

0.943 

(0.172) 

3.460 

0.965 

0.054 

0.960 

(0.039) 

6.240 

1.000 

0.151 

0.867 

(0.298) 

3.920 

0.887 

0.076 
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Table 4. Summary of Example 3.3. The average vector correlation coefficient (VCC) with standard error in parentheses, the 
average number of predictors selected (MS), true positive rate (TPR) and false positive rate (FPR), based on 200 data replications, 
for model (3.6), are reported 



/3i 





^2 




^3 




VCC 


MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 





Model (3.6): (n,po) = (100,10), 

autoregressive 

correlation 




gMAVE 

0.968 

(0.088) 




0.965 (0.069) 




0.979 (0.026) 




SgMAVE-LASSO 

0.962 

(0.154) 

6.970 

0.975 

0.627 

0.977 (0.071) 

7.410 

0.995 

0.677 

0.968 (0.140) 

7.105 

0.977 

0.643 

SgMAVE-SCAD 

0.973 

(0.120) 

4.570 

0.987 

0.324 

0.981 (0.028) 

4.685 

0.997 

0.336 

0.970 (0.139) 

4.685 

0.980 

0.340 

SgMAVE-MCP 

0.973 

(0.121) 

3.880 

0.985 

0.238 

0.978 (0.071) 

4.060 

0.995 

0.258 

0.970 (0.139) 

4.070 

0.980 

0.263 






Model (3.6): (n,po) = 

(100,10), compound symmetry 




gMAVE 

0.975 

(0.014) 




0.966 (0.052) 




0.978 (0.010) 




SgMAVE-LASSO 

0.985 

(0.014) 

6.535 

1.000 

0.566 

0.978 (0.053) 

6.815 

0.997 

0.602 

0.987 (0.010) 

6.825 

1.000 

0.603 

SgMAVE-SCAD 

0.987 

(0.014) 

4.030 

1.000 

0.253 

0.980 (0.071) 

3.945 

0.995 

0.244 

0.988 (0.011) 

4.320 

1.000 

0.290 

SgMAVE-MCP 

0.984 

(0.013) 

3.845 

1.000 

0.230 

0.977 (0.071) 

3.835 

0.995 

0.230 

0.985 (0.010) 

4.150 

1.000 

0.268 





Model (3.6): (n,po) = (200,20), 

autoregressive 

correlation 




gMAVE 

0.979 

(0.007) 




0.977 (0.009) 




0.983 (0.006) 




SgMAVE-LASSO 

0.997 

(0.003) 

5.815 

1.000 

0.211 

0.995 (0.004) 

8.215 

1.000 

0.345 

0.997 (0.003) 

5.935 

1.000 

0.218 

SgMAVE-SCAD 

0.996 

(0.006) 

4.505 

1.000 

0.139 

0.995 (0.006) 

4.705 

1.000 

0.150 

0.997 (0.005) 

4.570 

1.000 

0.142 

SgMAVE-MCP 

0.994 

(0.006) 

4.240 

1.000 

0.124 

0.993 (0.006) 

4.230 

1.000 

0.123 

0.995 (0.005) 

4.445 

1.000 

0.135 






Model (3.6): (n,po) = 

(200,20), compound symmetry 




gMAVE 

0.976 

(0.008) 




0.974 (0.011) 




0.980 (0.007) 




SgMAVE-LASSO 

0.997 

(0.002) 

3.845 

1.000 

0.102 

0.995 (0.004) 

6.340 

1.000 

0.241 

0.997 (0.003) 

4.315 

1.000 

0.128 

SgMAVE-SCAD 

0.996 

(0.006) 

3.170 

1.000 

0.065 

0.996 (0.005) 

3.370 

1.000 

0.076 

0.997 (0.005) 

3.330 

1.000 

0.073 

SgMAVE-MCP 

0.991 

(0.007) 

4.095 

1.000 

0.116 

0.989 (0.009) 

4.210 

1.000 

0.122 

0.992 (0.006) 

4.310 

1.000 

0.128 
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Table 4. (Continued) 



/3i 




(^2 




/33 




VCC 

MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 

VCC 

MS 

TPR 

FPR 




Model (3.6): (n,po) = (200,30), 

autoregressive 

correlation 




gMAVE 

0.960 (0.013) 




0.951 (0.017) 




0.967 (0.012) 




SgMAVE-LASSO 

0.993 (0.070) 

3.680 

0.995 

0.060 

0.996 (0.003) 

7.470 

1.000 

0.195 

0.993 (0.070) 

3.840 

0.995 

0.066 

SgMAVE-SCAD 

0.997 (0.006) 

3.710 

1.000 

0.061 

0.996 (0.008) 

4.335 

1.000 

0.083 

0.998 (0.004) 

4.005 

1.000 

0.071 

SgMAVE-MCP 

0.995 (0.007) 

3.810 

1.000 

0.064 

0.992 (0.009) 

4.255 

1.000 

0.080 

0.995 (0.006) 

4.030 

1.000 

0.072 





Model (3.6): (n,po) = 

(200,30), compound symmetry 




gMAVE 

0.951 (0.014) 




0.943 (0.021) 




0.959 (0.015) 




SgMAVE-LASSO 

0.997 (0.003) 

2.895 

1.000 

0.031 

0.995 (0.003) 

6.340 

1.000 

0.155 

0.998 (0.003) 

3.010 

1.000 

0.036 

SgMAVE-SCAD 

0.998 (0.003) 

2.345 

1.000 

0.012 

0.998 (0.003) 

2.635 

1.000 

0.022 

0.998 (0.002) 

2.375 

1.000 

0.013 

SgMAVE-MCP 

0.989 (0.009) 

4.190 

1.000 

0.078 

0.986 (0.012) 

4.820 

1.000 

0.100 

0.991 (0.007) 

4.310 

1.000 

0.082 
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Table 5. Run times (CPU seconds) for shrinkage group-wise MAVE of various sizes n, p and 
different correlation structures among the predictors for model (3.6) in Example 3.3 



Autoregressive correlation 

Compound symmetry 

(n,p) = (100,30) 

14 

13 

(n,p) = (200,60) 

87 

99 

(n,p) = (200,90) 

186 

200 


The attributes in this data set naturally fall into 3 groups corresponding to the 
substitution positions 1, 2 and 3. This is further conhrmed by the graphical repre¬ 
sentation of the correlation matrix of the attributes in Figure 1; for example, the at¬ 
tributes belonging to the third substitution position have a moderately strong corre¬ 
lation, but are weakly associated with most of the other attributes. We write V/ = 
(PL/,SZi,FLi,HDi,HA/, 7 rDi, 7 rA/,POi,crEi)^ for / = 1,2 and 3, where the attributes 
are represented by their two-letter abbreviations with the subscripts denoting the po¬ 
sition of substitution. All predictors are standardized to have mean zero and unit length 
(the predictor ttAs has no variability and is then removed). Thus, in this data set, the 
sample size is n = 74, the predictor dimension is p = 26, and the group information is 
{g,Pi,P 2 ,P 3 ) = (3,9,9, 8 ). We regard this “prior” information on predictor group struc¬ 
ture as a given fact. 

Ordinary least squares (OLS), LASSO, SCAD, MCP, group-wise MAVE (gMAVE) and 
shrinkage group-wise MAVE (SgMAVE-LASSO, SgMAVE-SCAD and SgMAVE-MCP) 
are applied to this data set. Before applying the group-wise MAVE procedure, we need to 
determine (di, d 2 ,d^). The BIC-type criterion of Li, Li and Zhu [19] that is a modification 
of Wang and Yin [29] yields (di,d 2 ,d 3 ) = (1,1,1), indicating that each predictor group 
is connected with the response variable through a single linear combination. The same 
criterion, when the group information is ignored (p = 1 ), also shows a three-dimensional 
structure in regression. 

The corresponding coefficient estimates are shown in the second through ninth columns 
of Table 6 . Using the attribute representation (that is, representing molecules by a set 
of physicochemical attributes), these methods generate a variety of possible influences 
of structure on activity. As we can see, the substituent at position 1 should not be a 
hydrogen-bond acceptor (HAi). We can also see that both the size and the flexibility 
of the substituent at positions 1 and 3, say SZi,FLi,SZ 3 and FL 3 , are informative to 
the activity of the pyrimidines. Previous analysis of the crystal structure of the complex 
formed between trimethoprim and DHER shows that the substituents at positions 1 and 
3, are buried in a hydrophobic environment, and restrictions on size and flexibility are 
consistent with this (Hirst, King and Sternberg [15]). The shrinkage group-wise MAVE 
methods also identify the cr-effect of the substituent at position 1 (crEi) as an influencing 
factor, which is in accordance with previous studies using machine learning techniques. 
However, the ordinary variable selection methods fail to detect it. 
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Figure 1. Graphical representation of the absolute correlation matrix of the 26 predictors for 
the pyrimidine data. The magnitude of each pairwise correlation is represented by a block in 
the grayscale image. 


Let Bi £ R®, B 2 £ R® and B3 £ R® denote direction estimates for the three predictor 
groups, respectively. We next consider the group-wise additive index model 

Y = Gi{Zi) + 02 ( 22 ) + G^{Z^) + e, 

where Zi = 'Bj'Vi, 1 = 1,2 and 3, are the extracted linear predictors, and Gi(-)’s are 
unknown univariate functions. We fit this model by applying the gam function in the 
publicly available R package mgcv. The adjusted percentages of total deviance explained, 
namely the adjusted R-squared values, for various methods are summarized in the last 
row of Table 6. Unreported results show that the nonparametric smoothing of all the 
three predictors yields better performance than the additive model using smoothing of 


























































































































Table 6 . Pyrimidine data. Estimated coefficients and adjusted i?-squared values {B?) from various methods 



OLS 

LASSO 

SCAD 

MCP 

gMAVE 

SgMAVE-LASSO 

SgMAVE-SCAD 

SgMAVE-MCP 

PLi 

-0.0437 

0 

0 

Vi: attributes of a 
0 0.6240 

substituent at position 1 
0.5620 

0.6043 

0.5896 

SZi 

0.0435 

0.0358 

0 

0.0586 

-0.3154 

-0.3772 

-0.3336 

-0.3470 

FLi 

-0.0444 

-0.0323 

-0.0216 

-0.0537 

0.2890 

0.3444 

0.3035 

0.3198 

HDi 

-0.0249 

-0.0197 

-0.0480 

-0.0273 

0.1079 

0.1589 

0.1288 

0.1407 

HAi 

0.0139 

0 

0.0358 

0 

-0.0227 

0 

0 

0 

ttDi 

0.0083 

0.0025 

0.0185 

0.0207 

-0.0521 

-0.0928 

-0.0898 

-0.0916 

TtAi 

0.0122 

0 

0.0160 

0 

-0.1901 

-0.1818 

-0.1985 

-0.1949 

POi 

0.0288 

0.0224 

0.0291 

0 

-0.3498 

-0.3604 

-0.3701 

-0.3711 

aEi 

0.0386 

0.0080 

0 

0 

-0.5040 

-0.4756 

-0.4798 

-0.4749 

PL 2 

-0.0281 

0 

-0.0025 

V 2 : attributes of a 
0 0.6924 

substituent at position 2 
0.6763 

0.6894 

0.6812 

SZ 2 

0.0396 

0.0152 

0.0324 

0 

-0.1058 

-0.1023 

-0.0991 

-0.0849 

FL 2 

-0.0430 

-0.0240 

-0.0407 

-0.0179 

-0.1802 

- 0.2211 

-0.1904 

-0.2232 

HD 2 

0.0076 

0 

0 

0 

-0.0982 

-0.0556 

-0.0841 

-0.0574 

HA 2 

0.0068 

0 

0 

0 

0.0402 

0.0440 

0.0328 

0.0494 

7rD2 

-0.0132 

0.0100 

0 

0 

0.0663 

0 

0.0534 

0 

7rA2 

0.0030 

0 

0 

0 

-0.1654 

-0.1702 

-0.1672 

-0.1770 

PO 2 

0.0218 

0.0145 

0.0148 

0.0257 

-0.3264 

-0.3575 

-0.3348 

-0.3462 

0 'E 2 

0.0176 

0 

0 

0 

-0.5720 

-0.5667 

-0.5722 

-0.5673 

PL 3 

0.0103 

0 

0 

V 3 : attributes of a 
0 -0.0925 

substituent at position 3 
-0.1599 

-0.1506 

-0.1501 

SZ 3 

0.0458 

0.0579 

0.0955 

0.0740 

0.7768 

0.8369 

0.9125 

0.9126 

FL 3 

-0.0311 

-0.0079 

-0.0309 

-0.0208 

-0.1757 

-0.1957 

-0.2048 

-0.2070 

HD 3 

-0.0540 

-0.0424 

-0.0294 

-0.0340 

0.0586 

0 

0 

0 

HA 3 

0.0595 

0 

0 

0 

-0.2928 

-0.2298 

-0.2689 

-0.2667 

7rD3 

-0.1958 

-0.0696 

-0.1620 

-0.1459 

-0.3503 

-0.3490 

-0.1738 

-0.1744 

PO 3 

0.0696 

0.0365 

0.0470 

0.0623 

-0.0992 

0.0080 

0 

0 

ctEs 

0.0987 

0.0363 

0.0864 

0.0806 

0.3677 

0.2468 

0 

0 


0.8206 

0.7934 

0.8311 

0.8226 

0.9150 

0.9241 

0.9170 

0.9210 


to 

CO 


Estimation for multi-index models 





24 


T. Wang, P. Xu and L. Zhu 


every single predictor, but the improvement is not statistically significant. As we can 
see, the proposed semi-parametric methods outperform the classical parametric ones as 
they can provide a mechanism for exploring nonlinear relationships between molecular 
structure and biological activity. 

Figure 2 provides the plots of estimated index functions, using for illustration the 
shrinkage group-wise MAVE method with the minimax concave penalty (SgMAVE- 
MCP). Erom Eigures 2(a) and (b), it can be seen that G'i(-) has a linear trend, while 
6 * 2 (■) is clearly curved, indicating a nonlinear parabolic dependence of activity on the ex¬ 
tracted linear combination of attributes at the second position of substitution. It can also 
be seen from Eigure 2(c) that G^{-) is very complicated, and nonparametric smoothing 
performs poorly in areas where observations are sparse. 


4. Discussion 

In this paper, we only provide the convergence rate of the shrinkage group-wise MAVE 
estimator. It is possible to derive the limiting distribution. However, the limiting distri¬ 
bution is too complicated to be applied for inference. Thus, for the time being, we are 
frustrated by the lack of a good approximation to the limiting distribution that can be 
used to set standard errors or to carry out tests on the parameter vector. 

As remarked by Knight and Eu [16], attaching standard errors to LASSO-type esti¬ 
mators is nontrivial. They then considered using the residual-based bootstrap method to 
estimate the sampling distribution of the LASSO estimator in a multiple linear regres¬ 
sion. However, Chatterjee and Lahiri [3] showed that the conditional residual bootstrap 
distribution given the data converges to a random measure; that is, the residual bootstrap 
estimate of the LASSO distribution is inconsistent. In a subsequent paper, Chatterjee 
and Lahiri [4] proposed a modified bootstrap method, and showed that it provides a valid 
approximation to the distribution of the LASSO estimator. 

But it is unclear yet whether or not the modified bootstrap method of Chatterjee and 
Lahiri [4] can be applied to our setting. The situation is complicated by the fact that 
in semi-parametric multiple-index models we need to take into account the interaction 
between nonparametric function estimation and shrinkage direction estimation. Work 
along this line is in progress. 


Appendix 

We need the following regularity conditions: 

(Al) E\Y\^ < oo and i?||X ||2 < oo for some large fc > 0, where || • II 2 denotes the £2 
norm. 

(A2) The density function of X has a bounded second derivative; ^(XIB^X = -w) and 
E(XXT|BTx = w) have bounded derivatives with respect to w and B for B in 
a small neighborhood of B*, that is, ||Pb — Pb* II2 < C for some small C > 0 . 
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(a) (b) 



- 1.5 - 1.0 - 0.5 0.0 0.5 1.0 

B3V3 


(c) 

Figure 2. The panels show the estimates of the terms in the group-wise additive index model 
for the pyrimidine data using for illustration SgMAVE-MCP. The upper left panel, the upper 
right panel and the lower panel are the smooth functions of the extracted linear predictor in 
predictor group one, two and three, respectively. The rug plots, along the bottom of each plot, 
show the values of the predictors of each smooth. Thin plate regression splines were used with 
smoothing parameters being selected by GCV. 
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(A3) The function A(F|B^X = w) has a bounded and continuous fourth derivative 
with respect to w and B for B in a small neighborhood of B*. 

(A4) The kernel K{-) is a Gaussian probability density function. 

(A5) d < 3 andoc 

(A6) B; = B*D° + and B/ = B*D° + Op(n“^/^) for some di x di nonsin¬ 

gular matrix D° for I = 1,... ,g. 

Note that conditions (A1)-(A4) are standard in the literature, see for instance Wang and 
Xia [28], Xia [32] and Li, Li and Zhu [19]. As shown in Xia [32], the ordinary MAVE 
estimator is root-n consistent under conditions similar to (Al)-(A5). Consequently, con¬ 
dition (A6) is very reasonable because we can view B = B; as a special case of a 
general B in Xia’s proof. If higher order local polynomial smoothing is used, the root-n 
consistency can also be achieved for d > 3; see Remark 5.3 in Xia [32]. Nevertheless, in 
practice models with d > 3 are not attractive due to the “curse of dimensionality”. 

Before we begin the proof, we need to introduce some additional notation. For a positive 
integer m, 0^ stands for an m-dimensional vector of zeros. For an mi x m 2 matrix A, 
vec(A) stands for the mim 2 -dimensional vector obtained by stacking the columns of A. 
For a diagonal matrix, we get the (generalized) inverse by taking the reciprocal of each 
nonzero element on the diagonal, leaving the zeros in place, and transposing the resulting 
matrix. 

Let B; = diag(Q;;)B;. Then B = ©f,^iB/. Let Bist denote the (s,t)th element of B/. 
Without loss of generality, we assume that D]* = , and the first qi components of B*j^ 

are nonzero. For each I = 1,... ,g, we define 


Hr = 


/diag(Bri) 


Vdiag(B;^J 


{diag(B*J} and H; = 


^ diag(B/i) \ 
\diag(BzdJ/ 


{diag(Ba)} \ 


where B*j denotes the tth column of B* and B/t denotes the tth column of B/, 

t=l,...,dp Let H* = 0f^^H* and H = 0f^^ Hp By condition (A6), IJH* - HII 2 = 
Op(n“^/^). 


Proof of Theorem 2.1. We shall concentrate on the optimization problem (2.5). The 
proof follows Theorem 1 of Bondell and Li [1] closely. First, we formulate an equivalent 
optimization problem that is easier to analyze theoretically. To see this, we note that 




i=i i=i 


- d* - ^ hf^Bj diag(v^ - v))aj 




=EE 


9 

( diag(Bn) \ - 

y^-d*-^{br®(v^-vi)T} 

A r' 

1=1 

\diag(Bid,)/ 
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=EE 



9 


E{br®w 


n 2 


V;) }H;diag(Ba)a; 


Suppose that {B;i = {Bm ,..., Bip,i)^ G ,l = 1,... ,g} is the minimizer of 


EE 









PI 




IS..I 


Z=1 s=l 


\B, 


/si 


(A.l) 


with respect to {Bn = (-B/ii, ■ ■ ■,G , ^ = 1, ■ ■ ■,g}- From Definition 2.1, it is 
easy to see that ais = for all / = 1,..., g and s = 1,.. .,pi. Further, vec(Bz) = 

fi/Bzi. 

Below we shall describe the details of the proof by breaking it up into two steps. Step 
I establishes the convergence rate of B. Step II shows that B attains sparsity. 

Step I. Let u = (u^, •.., uj)^ G where u/ = (u/i,..., u/p,G for I = 1,... ,g. 
Define 


j„(u)=A„££iEir' 


1=1 s=l 


U* _i_ 

^Zsl + —i= 
yjn 


Then, we have 


J„(u) - J„(Op) = iB/sil 

1=1 s=l 


R* _i_ 

^ 




If ujs = 0, then 


^\Bisi\-^^ 

sjn 


B 


Isl 


If Uls 7 ^ 0 and B^^^ ^ 0, then \Bisi\ {B^sil and 


Uls 

\fn 
1-1 


-\BLi\ I =0. 


Bt 


Uls 


- \Btsi\ ] ^ Uls X sgii{Bl^), 


■^Isl /— 

y/n 

where sgn(-) is the sign function. By Slutsky’s theorem. 


^\'Bisi\-^V^ 

y/n 


B, 


Isl 


Uls 

y/n 


-\Bl,\]=op{h^). 


If Uls 7^ 0 and B*^^ = 0, then 
A 


y/n 


\Bisi\ ^y/n 


R* 4- 


-IB, 


Isl 


^/n\B 


\Uls 


/si 


00 . 
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Define 


i=l i=l 


y. 


-ad- ® W - ( Bri + ^ 

Z=1 ^ 


After some algebra one gets 

^'„(u)-«'„(Op) 


=EE 


2=1 j=i Lz=i 
n n 

+ 2 EE 

1=1 i=i 


Elb^ ® (W - 




i=i 






Let V* = ..., B*7and v = (Bn^, ..., B^i^)^. Then, we have 

( -( n n \ _ 

“EE’^bVil^S- Hu 

” i=ii=i / 


?-t/ 1 


+ 2u^H H{v^(v*-v)} 


i=i i=i 


= Ti+T2, 


where v^- = (v^^i,.. -.^JjgV S Rpi'^i+'-'+Ps'^s, = bj (g) (v^ - 

First, we consider Ti. Note that 


1 = 1,...,g. 


n n \ 

ri = u^H*T H-u 


i=l j=l 




i=i j=i 


-"(h-Hb" -EE 




T.r.i 1 h*U 


i=l j=l 
n n 


(H - H‘) r E E ^»PA <H - »•)“ 


i=l j=l 
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Til + Ti2 + Ti3 + Ti4. 


By Lemma 4 in Wang and Xia [28] 



n n 


yVyW} = ^ + Op(l), 


j=i 


where A is a nonnegative definite matrix. Hence, we obtain 


Tn = Op(l), ri2 = Op(n-i/2), T13 = Op(n-i/2) and ri4 = Op(n-i). 

Next, we consider T2. Note that v = v* + Op(n“^/^) and 





/ 1 \ - 

+ 2uT(H - -EE(H - H*){y^(v* - ^)} 

\”i=li=l / 

= ?21 + T 22 + T 23 + T 24 . 

Thus, we arrive at 

T'2i = 0p(l), 722 = Op(n“^/^), 723 = Op(n“^/^) and r24 = Op(n“^). 

Let L„(u) = 4'„(u) + J„(u). If 0 for some lG{l,...,g} and s G {® + 1,. 
then L„(u) — L„(Op) —J^p 00 > 0. So we assume in the sequel that u G Z7, where 

W = {u G : ujs = 0 for alH = 1,... and s = ® + 1,... ,Pi}. 

It follows that Jn(u) — J„(0p) = op{hf) for any u G Let u G 

We consider the problem of minimizing L„(u) over U. Because nh^ —>■ 00 , we obtain 

-bn(u) — L„( 0 p) = Til + 721 + op(h^). 

Let (B*, A*) be an orthogonal matrix. Let C = 0f,^i{Id; ® (B*, A*)}. Then, according to 
Lemma 4 of Wang and Xia [28], the long version, there exists a (X]f=i dipi) x (X]f=i dipi) 
permutation matrix 11 = ®f,^i B; such that 
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where h ^Aii„ -^p An, h ^Ai 2 n ~^p A 12 , h ^A 2 i„ -^p A 21 and A 22 n -^p ^ 22 - 
Moreover, both An and A 22 are positive definite. 

Write Cn = (Di,D 2 ) with Di being of order iJ2i=i^iPi) ^ (Sf=i‘^f)- Let Zi = 
D]'^H*u and Z 2 = DjH*u. Write z = {zj,zj Now consider the function 


G„(z) = z^ 


f Alin 

\ A 21 „ 


Ai 2 n 

A 22 n 


Z 


+ 2 z"^ 


f An„ 

\ A 21 n 


Ai 2 n \ 

A 22 n J 


(cn)^H*{V^(v* 


v)} +op(h^). 


Denote z = (zj,zj)^ the minimizer of G„(z). It turns out that the conditions of 
Theorem 1 of Radchenko [22] are satisfied and, consequently, we have zi = Op(l) 
and Z 2 = Op(l). Over G, because u = (H*^H*)“^H*^(Di,D 2 )z, we have u = Op(l). 
We thus conclude that there exists a minimizer {Bn,^ = l,...,g} of (A.l) such that 
||Bn - B ;,||2 = Op(n-i/2) for alU = 1,.. . ,g. 

Since vec(B*) = H*B*^ and vec(Bi) = HiBn, by triangular inequality we have 


|vec(B,) - vec(Br )||2 < ||vec(HrBa) - vec(HrB* 


iiJ 


|vec(HiBa)-vec(H;B 


i ^n)\\2- 


Therefore, ||B; — B *||2 = Op(n“^/^) for all / = 1 ,..., g. 

Step 11. We show the variable selection consistency. Write Ai =I(B*) and Ani = 
I(Bi). For any s € ^Lat is, s G Ai for some I, the estimation consistency result 

indicates that djs -^p 1. Thus, P(s G 1- If f^en suffices to show that for 

any s' ^ULi-4z, Pis'G[jLiA„i ) —0. Consider the event {s' G Ani}- By standard 
Karush-Kuhn-Tucker conditions for optimality, we know that 






i=\ j=l 
' iT 


1=1 


X [{bj O (v^ - Vi) {Hfois'] X fo* = [Sis'll sgn(Sis'i) 


7^’ 


where eis' G is the vector containing a 1 in the s'th position and zeros elsewhere. 
Note that 

I ri 1—1 

[Sis'll —p= = -^- >P 00 

^ V^jSis'll 

and 




i=i 


-a*-g{b7®(v/-77}fiiB 


1 = 1 


X [{b7 O (v{ - v[)^}Hieis'] X w] = Op(l). 
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Thus, we obtain 


p(s'G Anl I < 'y ^ P{s' € Anl) —> 0. 

V 1=1 ) 1=1 


The proof is complete. 


□ 


Proof of Theorem 2.2. According to whether the fitted model M.\ is under-fitted, 
correctly fitted or over-fitted, we can divide K'*' = [0,oo) into three disjoint parts: 

fl_ = {A:7WA^Mr}, IIq = {A : TWa = Mt} 


and 


12+ = {A : Ad A 2 Adr, Ad a AdT}- 

Further, we assume a reference sequence of tuning parameters, which satisfies 

the conditions in Theorem 2.1. Clearly, B/i = -f and P{M\„ = Adp) —>■ 1. 

We write a = ■ • ■ i = (o^ii ■ • ■: ^ define 


RSSa4 = mm EE diag(v/ -v\)oLi 


= min 

2 = 1 j=l 


- a* - O (W - diag(Bii)ai 


1=1 


w= 


where Sm = {w = (wi,..., WpY G : Ws = 0, s ^ Ad}. 

For a generic model Ad, let {Bii(Ad),... ,Bgi(Ad)} be the minimizer of 


EE 

1=1 j=i 




1^1 


Wa 


with respect to (Bj^, • ■ - ^ Then, we have 


RSS^i = EE 

i=i j=i 


-a"- EW^ ® W “ vJ)^}HiBa 


1=1 


"r 


Further, for the full model Adp, B;i (Adp) = Bq for all I = 1,..., g. 

We first consider under-fitted models, that is, AdA ^ Adp. Note that 


inf RSSa —RSSa„> inf RSb^vi^ — RSSa„ > min RSSa/(—RSSa, 


Aen 


AGn_ 


M^M-i 
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By definition, we know that 

n n 

RSSai - RSSmf = EE 

1=1 j=l 


- d* - 0 w - vi)^}SzBa {M) 




Wa 


-EE 

i=li=lL 


-d*-£{br 0 (w-vn^}M 


/=1 


= EE 

i=i j=i 


Ei^r 0 (vf - vi)^}]i[z{Ba(Af) - ki} 


1=1 


According to Lemma 4 of Wang and Xia [28], there exists some constant k > 0 such that, 
for any Xi ^ Mr, 

RSS^vi — RSSxf > Knh^ 

with probability tending to 1. Since log(l + x) > min{0.5x, log(2)} for any x > 0, we have 
log(RSSA,) - log(RSSA,^) = logfl + ~ 


> min|log(2). 


RSSa^p 
RSSai-RSSm, 
2 RSSaip 


Following an argument similar to the one used in the proof of Lemma 1 of Xia et al. [34] , 
one can show that n~^ RSSxp -^p <J^ for some cr > 0. This, together with n~^ log(n) = 
o(h^), yields that 

p\ min BICA^-BICxj,+op(d2)>0|^l. 

Because B/i = B*^ + Op(n“^/^) and B;i = B*^ + Op(n“^/^), we obtain 


RSSa„ - RSSvvfp = Op ( - ) = opih^). 


Thus, we have 


( inf BIG A - BICa„ >o) > p ( inf BICa< - BICa<^ + BICa^^ - BIG a„ > o) 

VAefl— / ^A4d^A4T A 


1 . 

Next, we consider over-fitted models, that is, A4 a O M.t but A4 a ^ M.t- Observe that 


inf RSSa —RSSa > inf RSSaIx — RSSa„ > min RSStk —RSSa„. 
Aen+ Aen+ 
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For a generic model M., ii M. M.t, then one can show that 

RSSyoi = RSS^p +0 p • 

Because RSS^p — RSSa„ = 0_p(n“^), it follows that 

min RSSa 4 — RSSa. = Op ( — y 
m^Mt,m=^Mt \n J 

Then, with probability tending to 1, we have 

inf RSSa-RSSa 
A en+ n 

As a consequence, 

p( inf BICa-BICa„ > 0 ) >pf inf RSSa - RSSa„ > 0 ) ^ 1. 

VAen+ / yAen+ n J 

Combining, the proof is complete. 


□ 
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